9,084 research outputs found
Skew Class-balanced Re-weighting for Unbiased Scene Graph Generation
An unbiased scene graph generation (SGG) algorithm referred to as Skew
Class-balanced Re-weighting (SCR) is proposed for considering the unbiased
predicate prediction caused by the long-tailed distribution. The prior works
focus mainly on alleviating the deteriorating performances of the minority
predicate predictions, showing drastic dropping recall scores, i.e., losing the
majority predicate performances. It has not yet correctly analyzed the
trade-off between majority and minority predicate performances in the limited
SGG datasets. In this paper, to alleviate the issue, the Skew Class-balanced
Re-weighting (SCR) loss function is considered for the unbiased SGG models.
Leveraged by the skewness of biased predicate predictions, the SCR estimates
the target predicate weight coefficient and then re-weights more to the biased
predicates for better trading-off between the majority predicates and the
minority ones. Extensive experiments conducted on the standard Visual Genome
dataset and Open Image V4 \& V6 show the performances and generality of the SCR
with the traditional SGG models
Maximum margin learning of t-SPNs for cell classification with filtered input
An algorithm based on a deep probabilistic architecture referred to as a
tree-structured sum-product network (t-SPN) is considered for cell
classification. The t-SPN is constructed such that the unnormalized probability
is represented as conditional probabilities of a subset of most similar cell
classes. The constructed t-SPN architecture is learned by maximizing the
margin, which is the difference in the conditional probability between the true
and the most competitive false label. To enhance the generalization ability of
the architecture, L2-regularization (REG) is considered along with the maximum
margin (MM) criterion in the learning process. To highlight cell features, this
paper investigates the effectiveness of two generic high-pass filters: ideal
high-pass filtering and the Laplacian of Gaussian (LOG) filtering. On both
HEp-2 and Feulgen benchmark datasets, the t-SPN architecture learned based on
the max-margin criterion with regularization produced the highest accuracy rate
compared to other state-of-the-art algorithms that include convolutional neural
network (CNN) based algorithms. The ideal high-pass filter was more effective
on the HEp-2 dataset, which is based on immunofluorescence staining, while the
LOG was more effective on the Feulgen dataset, which is based on Feulgen
staining
Structured Co-reference Graph Attention for Video-grounded Dialogue
A video-grounded dialogue system referred to as the Structured Co-reference
Graph Attention (SCGA) is presented for decoding the answer sequence to a
question regarding a given video while keeping track of the dialogue context.
Although recent efforts have made great strides in improving the quality of the
response, performance is still far from satisfactory. The two main challenging
issues are as follows: (1) how to deduce co-reference among multiple modalities
and (2) how to reason on the rich underlying semantic structure of video with
complex spatial and temporal dynamics. To this end, SCGA is based on (1)
Structured Co-reference Resolver that performs dereferencing via building a
structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner
that captures local-to-global dynamics of video via gradually neighboring graph
attention. SCGA makes use of pointer network to dynamically replicate parts of
the question for decoding the answer sequence. The validity of the proposed
SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging
video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA
benchmark. Our empirical results show that SCGA outperforms other
state-of-the-art dialogue systems on both benchmarks, while extensive ablation
study and qualitative analysis reveal performance gain and improved
interpretability.Comment: Accepted to AAAI202
Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model
Developing an agent in reinforcement learning (RL) that is capable of
performing complex control tasks directly from high-dimensional observation
such as raw pixels is yet a challenge as efforts are made towards improving
sample efficiency and generalization. This paper considers a learning framework
for Curiosity Contrastive Forward Dynamics Model (CCFDM) in achieving a more
sample-efficient RL based directly on raw pixels. CCFDM incorporates a forward
dynamics model (FDM) and performs contrastive learning to train its deep
convolutional neural network-based image encoder (IE) to extract conducive
spatial and temporal information for achieving a more sample efficiency for RL.
In addition, during training, CCFDM provides intrinsic rewards, produced based
on FDM prediction error, encourages the curiosity of the RL agent to improve
exploration. The diverge and less-repetitive observations provide by both our
exploration strategy and data augmentation available in contrastive learning
improve not only the sample efficiency but also the generalization. Performance
of existing model-free RL methods such as Soft Actor-Critic built on top of
CCFDM outperforms prior state-of-the-art pixel-based RL methods on the DeepMind
Control Suite benchmark
- …